Improving Data Quality: Consistency and Accuracy
نویسندگان
چکیده
Two central criteria for data quality are consistency and accuracy. Inconsistencies and errors in a database often emerge as violations of integrity constraints. Given a dirty database D, one needs automated methods to make it consistent, i.e., find a repair D that satisfies the constraints and “minimally” differs from D. Equally important is to ensure that the automatically-generated repair D is accurate, or makes sense, i.e., D differs from the “correct” data within a predefined bound. This paper studies effective methods for improving both data consistency and accuracy. We employ a class of conditional functional dependencies (CFDs) proposed in [6] to specify the consistency of the data, which are able to capture inconsistencies and errors beyond what their traditional counterparts can catch. To improve the consistency of the data, we propose two algorithms: one for automatically computing a repair D that satisfies a given set of CFDs, and the other for incrementally finding a repair in response to updates to a clean database. We show that both problems are intractable. Although our algorithms are necessarily heuristic, we experimentally verify that the methods are effective and efficient. Moreover, we develop a statistical method that guarantees that the repairs found by the algorithms are accurate above a predefined rate without incurring excessive user interaction.
منابع مشابه
A Questionnaire-based Data Quality Methodology
Data quality (DQ) has been defined as “fitness for use” of the data (also called Information Quality). A single aspect of data quality is defined as a “dimension” such as “consistency”, “accuracy”, “completeness”, or “timeliness”. In order to assess and improve data quality, “methodologies” have been defined. Data quality methodologies are sets of guidelines and techniques that are designed for...
متن کاملConsistency-Aware Search for Word Alignment
As conventional word alignment search algorithms usually ignore the consistency constraint in translation rule extraction, improving alignment accuracy does not necessarily increase translation quality. We propose to use coverage, which reflects how well extracted phrases can recover the training data, to enable word alignment to model consistency and correlate better with machine translation. ...
متن کاملA Metrics-Driven Approach for Quality Assessment of Linked Open Data
The main objective of the Web of Data paradigm is to crystallize knowledge through the interlinking of already existing but dispersed data. The usefulness of the developed knowledge depends strongly on the quality of the published data. Researchers have observed many deficiencies with regard to the quality of Linked Open Data. The first step towards improving the quality of data released as a p...
متن کاملCyber Forensics Assurance
As the usage of Cyber Forensics increases, so does the potential for errors in the practice of applying Cyber Forensic. Errors in opinions derived from faulty practices have resulted in grievous miscarriages of justice. However, utilizing the foundations of Information Systems Assurance and Information Quality, a solid foundation for improving the quality and effectiveness of Cyber Forensics ca...
متن کاملImproved quality monitoring of multi-center acupuncture clinical trials in China
BACKGROUND In 2007, the Chinese Science Division of the State Administration of Traditional Chinese Medicine(TCM) convened a special conference to discuss quality control for TCM clinical research. Control and assurance standards were established to guarantee the quality of clinical research. This paper provides practical guidelines for implementing strict and reproducible quality control for a...
متن کامل